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ABSTRACT 

The construction of parallel editions of conventional 
tests for purposes of test security while maintaining score 
comparability has always been a recognized and difficult problem in 
psychometrics and test construction. The introduction of new modes of 
test construction, e.g., adaptive testing, changes the nature of the 
problem, but does not make it disappear. Items in adaptive testing 
pools may become overused and require replacement. However, in order 
to insure score comparability, important characteristics of the pool 
must remain constant. Three niethods of selecting candidate new items 
and three methods of identifying items for replacement are developed 
and compared with each other and with a previous method through a 
simulation study. Results indicated that using conventional item 
statistics to screen items before deciding lo seed them was important 
and effective in terms of maintaining the information structure of 
the adaptivi test item pool. The online calibration of larger sets of 
seeded items from which to select replacements can substantially 
improve the ease with which the information structure of the pool can 
be maintained. (Contains 1 table, 11 figures, and 4 references.) 
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Abstract 

The construction of parallel editions of conventional tests 
for purposes of test security while maintaining score 
comparability has always been a recognized and difficult problem 
in psychometrics and test construction. The introduction of new 
modes of testing, e.g., adaptive testing, changes the nature of 
the problem but does not make it disappear. Items in adaptive 
test item pools may become overused and require replacement. 
However, in order to insure score comparability, important 
characteristics of the pool must remain constant. Three methods 
of selecting candidate new items and three methods of identifying 
items for replacement are developed and compared with each other 
and with a previous method through a simulation study. 
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Some Considerations in Maintaining Adaptive Test Item Pools 

Introduction 

Test development specialists and psychometricians have long 
struggled with the problems associated with the construction of 
parallel editions of a single conventional test , The decision to 
issue a new test edition is usually based on the desire to 
preserve test security by preventing overexposure of test 
editions. Typically, all items in a conventional test are 
replaced by a new set of items that conform to the same content 
and statistical specifications as the original test edition. To 
compensate for any remaining differences between the new and 
original test editions, statistical procedures are usually 
employed to insure that scores resulting from the administration 
of either test edit ic»n have the same interpretation . 

New advances in psychometrics and computer technology 
encourage indivic.ualized (adaptive) testing on a microcomputer, 
where each exami.iee is administered a small set of items drawn 
from a larger item pool. Using a possibly very complex set of 
decision rules, examinees may receive completely different sets of 
items. Two issues immediately arise in this context. First, in 
order to make examinee scores comparable on different sets of 
items, measures must be taken to control the content and 
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statistical properties of the item sets appropriately. Second, 
when faced with decisions to replace overexposed items in the item 
pool from which the individualized tests are drawn, care must be 
taken to insure that the characteristics of the item pool remain 
as nearly constant as possible, so that the accuracy of estimated 
adaptive test scores remains the same across various editions of 
the item pool. Issues surrounding this latter topic are addressed 
in this paper. 

The next section describes an idealized setting for adaptive 
testing as a context for some practical constraints. A convenient 
method of analyzing and comparing certain features of item pools 
is detailed in the following section. Remaining sections of this 
paper will describe a particular practical problem in maintaining 
adaptive test item pools, and some potential solutions to this 
problem. An investigation of the efficacy of these solutions when 
applied to simulated data is described, and the results discussed. 

An Idealized Setting and Some Practical Constraints 

The major psychometric appeal of adaptive testing is the 
promise of equally precise measurement of all examint^^c, 
regardless of their ability levels. Aside from the details of a 
particular adaptive testing algorithm, the promise of equal 
measurement precision rests on certain strong assumptions about 
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the item pool. The first assumption made is that it is possible 
to obtain sufficient numbers of items appropriate for all ability 
levels. Secondly, it is assumed that the ' a ppropriateness ^ of an 
item is related to the precision with which a particular item will 
measure an examinee with a particular level of ability. The third 
assumption made is that the set of items appropriate for a 
particular level of ability represents a certain average level of 
precision, and that this precision remains constant across 
examinee ability levels. In the circumstances considered in this 
paper in which items in the pool must be replaced from time to 
time, it is further assumed that the replacement items are 
psychometricallv equivalent to the items being discarded. 

Thus, in an idealized setting in which the goal of testing is 
to measure all abilities with equal precision, the ideal item pool 
consists of sufficient numbers of items whose measure of precision 
follows a rectangular distribution across the entire ability range 
to be measured. Further, in this setting, the psychometric 
properties of this ideal item pool are not affected by the process 
of discarding some items and replacing them with others. Given 
sufficiently expert item writers, with sufficient time and money 
to complete many cycles of item writing and pretesting, it is 
possible that this ideal situation could be realized. 
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However, in practice, many compromises are made: 1) The 

abilities of interest are restricted to some finite range. This 
automatically decreases the necessary item production effort by 
denoting ability levels outside the specified range as 
unimportant . 

2) The size of the item pool is limited. The limit is 
determined not only by the numbers of items required for adaptive 
tests of various lengths, but also by the computer resources 
required for item storage and display. 

3) In the production of items for the pool, only a finite 
number of cycles of item writing and pretesting are conducted. 

Thus the item pool will consist of the best items that could be 
produced for a certain fixed cost. It is unlikely that such a 
pool can contain sufficient numbers of appropriate items, even for 
the abilities within the restricted range of interest. Thus a 
further compromise is required - - to measure some ability levels 
with more precision than other ability levels. 

4) If the adaptive test is administered to a group of 
examinees whose distribution of ability is bell shaped, the items 
most vulnerable to overexposure, for commonly used item selection 
algorithms, are those that are most appropriate for the average 
examinee. In the production of replacement items, only a finite 
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number of cycles of item writing and pretesting are conducted, as 
before. Even with the most sophisticated item writers, it is 
unlikely that this production effort can be sufficiently narrowly 
focused to result in an adequate number of items that are 
psychometr ically equivalent to those items most appropriate for 
the average examinee. Thus another compromise -- the psychometric 
properties of the item pool may change over cycles of item pool 
refreshment . 

5) Some items in the item pool may be appropriate for such 
extreme ability levels that they are infrequently, and sometimes 
never, administered when the adaptive test is given to finite 
samples of examinees. This naturally leads to the consideration 
of removing these items, tc gain more room in an item pool of 
fixed size for items that are appropriate for more typical 
examinees. In the real-world situation, where items are 
appropriate at more than a single level of ability, this can be a 
mechanism for increasing precision at typical levels of ability at 
the sacrifice of precision at more extreme levels of ability. 

This results in yet another compromise -- the 'effective' range of 
the abilities of interest is shrunk. 

The issues addressed in this paper arise in the context of 
the constraints and compromises imposed by the process of moving 
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adaptive testing out of the theoretical realm and into the 
practical realm. 

A Convenient Method of Analyzing Certain Item Pool Features 
The adaptive test algorithm used in this paper, as well as 
most adaptive testing algorithms in current use, rest on modern 
model-based psychometrics such as Item Response Theory (IRT) . In 
IRT, one way of characterizing the precision with which an item 
measures an ability is by the item information function (Lord, 

1980, equation 5-9). The information structure for a collection 
of items can be characterized by the test information function 
(Lord, 1980, equation 5-6), which is formed by taking the simple 
sum, a different abilties, of the values of the item information 
functions. This test information function is the maximum amount 
of information that can be obtained from the item set if it were 
administered as a conventional test. 

It bears emphasizing to note that the test information for an 
adaptive test item pool is not the information function for an 
adaptive test using this item pool. The adaptive test information 
function depends upon the items actually taken by examinees. This 
is determined not only by the information structure of the item 
pool, but also by the details of the algorithm such as those that 
specify the selection of the first and subsequent items for 
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administration, randomization of item selection to increase item 
security, the rule used to stop item administration, and the 
method of scoring the adaptive test. The adaptive test 
information function for algorithms of the type used here can only 
be conveniently estimated from numerical approximations using 
Monte Carlo results (see, for example. Lord, 1980, section 10.6). 

In this discussion, the estimated test information function 
is viewed as a convenient mechanism for discovering changes in the 
information structure of the item pool upon which the adaptive 
testing algorithm will operate. This estimated test information 
function is obtained by substituting estimated, rather than true, 
parameters into Lord's equation, and is the only test information 
that is computable in practical applications where true parameters 
are unknown. In the context of the idealized slitting discussed 
previously, the optimum item pool, in terms of an information 
measure, would have constant estimated test information across all 
ability levels, and would not change as items are discarded and 
replaced . 

A Practical Problem in Item Pool Maintenance 
A number of agencies of the Department of Defense recently 
funded a three-year project to develop and evaluate different 
methods of on-line calibration for the computerized adaptive Armed 
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Services Vocational Aptitude Battery (CAT-ASVAB) (Bock, Davis, 
Holland, Levine, Samejima, & Stocking, 1988). On-line calibration 
methods are procedures to obtain parameter estimates for new items 
that are candidates for inclusion in subsequent item pools from 
data collected during an examinee's testing session (on-line). 

For this particular project the final parameter estimates were 
constrained to be based on the 3-parameter logistic model of item 
response functions (Lord, 1980, equation 2-1). As part of this 
project, a method of on-line calibration based on the estimation 
procedures in the LOGIST computer program (Wingersky, 1983) was 
explored by the author; Bock, Levine, and Samejima developed other 
methods . 

In on-line calibration, each examinee is administered 
(seeded) a small number of items that are candidates for inclusion 
in the next version of the item pool. In the LOGIST-based method, 
examinees are also administered a small number of 'anchor' items. 
These anchor items are not part of the adaptive test item pool, 
although they have well-determined parameter estimates that are on 
the same metric as those of the item pool. The responses to 
neither the seeded items nor the anchor items are us^d in the 
operation of the adaptive test algorithm. In the LOGIST-based 
method of on line calibration, the responses to items administered 




14 



Some Considerations 



11 

in the adaptive test are used to compute a maximum likelihood 
estimate of examinee ability. The item responses to the seeded 
items and the anchor items are used, along with these ability 
estimates, to obtain parameter estimates for the seeded items and 
to reestimate th:; parameters for the anchor items. The two sets 
of parameter estimates for the anchor items, the original set on 
the scale of the item pool and those resulting from the on-line 
response data collection, are used to develop a scaling 
transformation that places the parameter estimates for the seeded 
items onto the metric of the adaptive test item pool. 

The final phase of the On-line Calibration project consisted 
of a sequence of four simulations of adaptive testing and item 
pool refreshment for each method of on-line calibration. The 
generating (or true) item response functions used in the 
simulations were nonparametric (and frequently nonmonotonic) 
functions developed by Levine (Bock et al . , 1988). All simulated 
examinees (simulees) were drawn from a bell- shaped distribution of 
true ability also generated by Levine (Bock et al . , 1988). Davis 
(Bock et al., 1988) selected seeded items, conducted all 
simulations of adaptive testing and the collection of data on the 
seeded items, and identified items already in the pool to be 
replaced. Individual experimenters were responsible for the on- 



It 



u 



O 

ERIC 



Some Considerations 



12 

line calibration of seeded items and the selection of a subset of 
these to replace those items to be discarded from the pool. 

Starting with an initial item pool (called the Round 0 pool) , 
adaptive testing was simulated using this pool; responses to 
seeded items were collected simultaneously. Items were then 
identified for elimination from the pool, and, for the LOGIST- 
based method, replacement items were selected from the seeded new 
items to maintain an item pool of constant size with an 
information function similar to that of the Round 0 pool. This 
was considered to be the first 'Round' of adaptive testing and 
item pool refreshment. The second Round proceeded using the 
refreshed pool created during the first Round; the third Round 
used the refreshed pool from the second Round; and the fourth and 
final Round used the refreshed pool from the third Round. 

During the progress of these simulations, it became apparent 
that the rule employed for the selection of candidate new items 
for seeding and the rule for the elimination of old items from the 
pool had important impacts on the information structure of 
subsequent item pools. The original item pool consisted of 100 
items that were selected on the basis of estimated information 
from a collection of 258 5-choice items. At each Round (of four) 
of adaptive testing and item pool refreshment, a set of 50 items 
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to seed was obtained by random selection from the collection of 
258 items. Also at each Round, the 25 items already in the pool 
that received the highest number of administrations in the 
adaptive test simulations, accumulated across the current and all 
previous Rounds, were designated as items that must be replaced by 
selecting 25 (half) of the seeded items. 

Eliminating the 25 items most frequently used in adaptive 
test simulations, where simulees were drawn from a typical 
distribution of true ability, resulted in the elimination of 25 
middle difficulty items with good discriminations and low guessing 
parameters in Round 1. The attempt to replace the eliminated 
items by selecting half of the 50 seeded items resulted in an 
initial large decrease in estimated test information for the item 
pool at middle ability levels on the first Round, and small 
fluctuations around this initial decrease in subsequent Rounds. 
Figure 1 shows the estimated test information functions for the 
item pool at each Round for the LOGIST-based method of on-line 
calibration. 



Insert Figure 1 about here 
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By changing the rules used to select items for seeding and 
for elimination from the pool, it should be possible to produce 
less dramatic changes in the information structure of the item 
pool. This study tries out three selection rules and three 
elimination rules. 



During the previous simulations, seeded items were randomly 
selected from the collection of 258 items. In every Round, the 
25- item set selected as replacement items was nearly as good, in 
terms of estimated test information for middle ability levels, as 
the complete set of 50 candidate items. Figure 2 shows the 
estimated test information functions for the set of 50 seeded 
items and the 25 replacement items selected from them for the 
refreshment of the Round 0 pool. These results are typical of 
other Rounds. For improvements in the process for middle ability 
levels, then, we need to improve the quality of the items selected 
for seeding. 



In practice, items should not be considered as candidates for 
an adaptive test item pool until some rough idea has been obtained 
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as to their quality. A reasonable approach would be to gather 
some conventional statistics on such items for screening purposes. 
The three rules proposed here utilize the conventional 
proportions-correct and r-biserials. 

Selection Rule 1 

Selection Rule 1 will consider only those of the 258 items 
that have conventional proportions-correct between .2 and .9, and 
r-biserials of at least .2. Of those items meeting these 
criteria, a random sample of 50 will be selected as the set of 
items to be seeded. This rule is only a slight modification of 
the previous rule. 

Selection Rule 2 

This Selection Rule will use the same relatively 
unrestrict ive screening of the collection of 258 items, but will 
randomly select 100 items as the set of items to be seeded. This 
is a greater departure from the previous rule in that twice as 
many items are now available for possible selection into the 
adaptive test pool. 

Selection Rule 3 

This Selection Rule will use a more restrictive screening. 

We know that it is the middle difficulty items that will be most 
used when the adaptive test is administered to a typical group. 
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It seems reasonable to capitalize on this knowledge. This 
Selection Rule , like the others , will eliminate those items of the 
258 with r-biserials less than .2. Then, 100 items will be 
selected for seeding whose proportions-correct are between .4 and 
.8, indicating that these items are about middle difficulty for 
5-choice items. 



At the end of the previous four Rounds of simulations, about 
30% of the final item pool consisted of items thit were retained 
from the initial item pool. All of these retained items had 
difficulties greater than 1.0 in absolute value. Although these 
items had been available for administration to 60,000 simulees by 
the end of Round 4, they had not accumulated sufficient responses 
to be among the 25 most used items at any Round. A different but 
overlapping 30% of the final item pool consisted of items with 
fewer than 1000 (and sometimes no) cumulative responses. Most of 
these items had estimated difficulties greater than 1,5 in 
absolute value. To retain so many little used items in the face 
of the change in information structure of the pool for average 
examinees may be inefficient for adaptive testing with a typical 
group of simulees. It may be more efficient to shrink the 
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effective ability range of interest. Two of the three elimination 
rules proposed here capitalize on this idea. 

Elimination Rule 1 

This Elimination Rule is identical to that used in the 
previous study. The 25 items receiving the highest number of 
adaptive administrations will be eliminated from the pool and 
replacements selected for them. 

Elimination Rule 2 

The 25 items most used in the adaptive test simulations will 
be eliminated, as in Elimination Rule 1. In addition, the 25 
items least used in the adaptive test simulations will also be 
eliminated. A set of 50 replacement items will be selected. 
Elimination Rule 3 

As before, the 25 most used items will be eliminated. In 
addition, the 5 least used items will be eliminated and a set of 
30 replacement items will be selected. 

The Current Study 

The Data 

For purposes of this study, it was decided to focus on the 
item pools from Round 0 and Round 1 of the previous simulations. 
The change in information structure is largest between these 
pools, which were used for the first and second adaptive test 
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simulations, respectively. While the method used to build the 
initial Round 0 item pool produces an overly optimistic estimated 
test information function for that pool, the change in the 
characteristics of the pool from Round 0 to Round 1 is real. 

Figure 3 shows the drop in the true information function for the 
Round 1 pool when compared to that of the Round 0 pool . 



Insert Figure 3 about here 



Implementation of Selection Rules 

Davis provided the data for the computation of conventional 
proportions-correct and r-biserials by simulating the 
administration of all 258 items to a random sample of 500 
simulees. These simulees were drawn from the same distribution of 
true ability used in the previous simulations. Figure 4 shows a 
scatterplot of the r-biserials against the proportions-correct for 
all 258 items. Approximately half of the 258 items are easy items 
with proportions-correct above .9. 



Insert Figure 4 about here 
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There were 118 items that met the criteria for inclusion for 
Selection Rules 1 or 2, i. e., r-biserials of at least .2 and 
proportions -correct between .2 and ,9. From this set of items, 

100 were chosen at random to form the set of Selection Rule 2 
seeded items. Of these 100, a randomly chosen subset of 50 were 
selected to be the seeded items for Selection Rule 1, Summary 
statistics for both of these item sets are shown in Table 1. For 
both sets of items, the correlation between proportions -correct 
and r-biserials is moderately high. This suggests that the more 
difficult items are also more informative. 

Only 51 items met the criteria for inclusion for Selection 
Rule 3. To provide the necessary 100 items, 49 items were sampled 
randomly with replacement from the 51. Summary statistics for 
Selection Rule 3 items are also shown in Table 1. The correlation 
between proportions -correct and r-biserials is reduced as are 
staiidard deviations when compared to the other item sets because 
the range of proportions -correct is restricted. 



Insert Table 1 about here 
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Adaptive Test Simulations 

Davis simulated the administration of an adaptive test to a 
sample of 30,000 simulees drawn from the distribution of ability 
used in the previous study . In addition to the adaptive test , 
each simulee responded to a random set of five anchor items (out. 
of 25) as required by the LOGIST-based methoc of on-line 
calibration. Each of the first 15,000 simulees was seeded a 
random set of five of the 50 Selection Rule 1 items. All 30,000 
simulees were seeded random sets of five of the 100 Selection Rule 
2 items and also random sets of five of the 100 Selection Rule 3 
items. Thus each anchor item received about 6000 responses, and 
each of the items in the sets of seeded items received about 1500 
responses . 

On-line Calibrations 

Three separate on-line calibrations were preformed using the 
LOGIST-based anchor item approach developed for the previous 
study, one for each set of seeded items associated with a 
particular Selection Rule. The first LOGIST calibration used the 
15000 simulees who responded to Selection Rule 1 items as well 
the anchor items. The second LOGIST calibration used the 30,000 
simulees responding to Selection Rule 2 items as well as the 
anchor items. The final LOGIST calibration used the sane 30,000 
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simulees, but only their responses to the anchor items and the 
Selection Rule 3 item set. Characteristic curve scale 
transformations (Stocking and Lord, 1983) using the new item 
parameter estimates for the anchor items were then used to place 
the results of each calibration, independently, onto the scale of 
the Round 0 item pool. 

The Choosing of Replacement Items 

The Elimination Rules studied mandate the discarding of 25, 

50, or 30 items from the pool. The Selection Rules prescribe the 
choice of sufficient items to maintain pool size from a set of 50 
or one of two different sets of 100 candidate new items. 

Regardless of the number of items to be discarded or the set from 
which replacements were to be selected, the same algorithm was 
used to choose the appropriate number of replacement items from 
the set of seeded items. 

A 'target' information function was defined as the estimated 
test information function of the items discarded using Elimination 
Rule 1, that is, the 25 items most frequently used in the adaptive 
test simulation. The use of this target across Selection and 
Elimination Rules insures that the space obtained in the pool by 
discarding any little-used items will be utilized to select 
replacement items for the over-used items only. 
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Two methods of choosing items to match the target information 
function were employed. The first method chose items with the 
greatest area under their estimated item information functions 
within ability levels that appeared important based on the target 
information function. The second method chose items on the basis 
of the area under the estimated item information functions and 
then attempted to improve on this by discarding some items and 
selecting others that minimized the maximum difference between the 
target and the draft estimated test information functions. 

Neither of these methods of choosing replacement items worked 
automatically without intervention. The replacement items were 
ultimately chosen on the basis of a subjective criterion: item 

sets with estimated information functions closer to the target 
over middle ranges of ability were preferable to item sets with 
estimated information functions more distant from the target in 
the middle but closer at the extremes. Both of the methods 
required tinkering with the ability limits within which a match to 
the target was desired. 

Results 

The sets of seeded items resulting from each Selection Rule 
were used with each Elimination Rule to develop a new 100 i'ern 
pool. That is, the 50- item set of seeded items resulting from 
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Selection Rule 1 was used as a source of 25, 50 and 30 replacement 
items for Elimination Rule 1, 2 and 3 respectively. The same 
pattern was repeated for the 100 item sets resulting from 
Selection Rules 2 and 3. The effects of the different Selection 
and Elimination Rules were compared to each other through the use 
of the estimated test information function for the resulting 100 
item pool. These results were also compared to the original Round 

0 pool estimated test information function, as well as to the 
previous Round 1 pool estimated test information function. 

Figure 5 shows the estimated test information functions for 
the sets of seeded items resulting from the three Selection Rules. 
These can be interpreted as showing what is available to work 
with, in terms of estimated information, when selecting the 
appropriate number of replacement items for each Elimination Rule. 
Also on the same plot is the target test information function for 
the 25 most used items in the Round 0 item pool. As expected, the 
estimated information function for Rule 3 is highest and 
narrowest; the conventional proportions -correct for the items 
selected covered a fairly narrow range. Also as expected, the 
shapes of the estimated information functions for Selection Rules 

1 and 2 are similar, with the Rule 2 estimated information 
function about twice as high as that for Rule 1 in the middle 
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ranges of abilities. This seems reasonable since the same 
moderate screening was applied under both rules, and there are 
twice as many Rule 2 items as Rule 1 items. 



Insert Figure 5 about here 



Also shown on the same figure is the estimated test 
information function resulting from the random selection rule used 
in the previous simulations. While the Rule 1 set has the same 
number of items, it is clearly a more informative set of items 
than that chosen by the previous random selection rule. 

Selection Rules 

Figure 6 displays the results for Selection Rule 1 using each 
Elimination Rule, in terms of estimated information (top) and 
relative efficiencies (bottom) of the resultant 100- item pools. 

The estimated information for the original Round 0 pool and the 
Round 1 pool produced using the random selection rule and 
elimination rule of the previous study are displayed on the graph 
for comparison. It seems clear that the moderate screening is 
effective. Replacing the 25 most used items (Elimination Rule 1) 
with 25 items that have been subjected to a moderate screening 
yields a higher estimated information function for middle ability 
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levels than if the 25 items have not been screened. Replacing the 
25 most used and 25 least used items in the pool (Elimination Rule 
2) with all 50 of the moderately screened items is less 
satisfactory. The estimated information is only slighted higher 
than when replacing 25 items in the middle, but too high at higher 
abilities and too low at lower abilities. Replacing 30 rather 
than 25 items (Elimination Rule 3) is only a very slight 
improvement over replacing 25 items. 

Insert Figure 6 about here 



The same conclusions may be drawn from the relative 
efficiency graph. The efficiency of each pool constructed by the 
present rules and the previous rule is computed relative to the 
Round 0 pool . 

Comparable plots of estimated information and relative 
efficiencies are displayed for Selection Rule 2 (Figure 7) and 
Selection Rule 3 (Figure 8). Selection Rule 2, which provides 100 
moderately screened seeded items, does substantially better than 
random selection. This is true even when only the 25 most used 
items in the Round 0 pool are identified for replacement 
(Elimination Rule 1) . When 30 items are to be replaced 
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(Elimination Rule 3). Selection Rule 2 produces a new pool that 
has nearly the same information structure as the original Round 0 
pool. When 50 items are to be replaced -- the 25 most used and 
the 25 least used (Elimination Rule 2) -- Selection Rule 2 
produces a pool that has higher test information for middle 
ability levels, and lower test information for extreme ability 
levels, when compared to the Round 0 pool. 



Insert Figures 7 and 8 about here 



Selection Rule 3, by providing 100 seeded items that have 
been subjected to a more restrictive screening, nearly matches the 
information structure of Round 0 pool when either 25 or 30 items 
are replaced (Eliminations Rules 1 and 3). When 50 items are 
replaced (Elimination Rule 2), the resultant new pool's 
information structure is changed to be sharply higher at middle 
ability levels and lower at extreme ability levels. 

Elimination Rules 

The three Figures just examined show, for each Selection 
Rule, the consequences of each Elimination Rule. It is also 
informative to look at the same data along the other facet, that 
is, for each Elimination Rule, the consequences of each Selection 
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Rule. Figures 9 through 11 show the new pools produced by each 
Selection Rule for Elimination Rules 1, 2 and 3 respectively. 



Insert Figures 9, 10 and 11 about here 



When only the 25 most used items are eliminated (Elimination 
Rule 1, Figure 9), seeded items from Selection Rule 3 provide the 
item pool most similar to the Round 0 pool. The results of all 
selection rules are, in fact, strictly ordered for middle ability 
levels in terms of information structure. The most different new 
pool is produced when the set of seeded items has been randomly 
selected. The most similar new pool is produced when the set of 
candidate items is larger, and has been subjected to fairly strict 
screening. This pool is nearly as good as the Round 0 pool in 
terms of estimated information for middle ability levels. 

When 50 items are to be eliminated (Eliifiination Rule 2), 
Figure 10 shows that selecting replacements from the larger item 
sets (Selection Rules 2 and 3) produces new pools that are more 
informative than the Round 0 pool for middle ability levels and 
less informative for more extreme ability levels. This may be 
undesirable because the information structure of the resultant 
pools is changed for almost all levels of ability. Selecting as 
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replacements all 50 seeded items provided by Selection Rule 1 does 
not yield as much information as the Round 0 pool for middle or 
low ability levels, but yields more information for higher ability 
levels . 

A more moderate approach is to eliminate the 25 most used and 
the 5 least used items in the pool (Elimination Rule 3, Figure 
11). In terms of reproducing the estimated information structure 
of the Round 0 pool, selecting 30 items from 100 items that have 
been moderately screened (Selection Rule 2) or more strictly 
screened (Selection Rule 3) produce very similar results. Both of 
these approaches replicate the Round 0 estimated information 
structure well. Selecting 30 replacement items from only 50 
moderately screened seeded items provided by Selection Rule. 1 
provides more information than the random selection approach, but 
still does not approximate the estimated information structure of 
the Round 0 pool very well. 

Discussion 

The context of this study has been adaptive tests 
administered to examinees whose distribution of ability is bell- 
shaped. While this is probably the most common context in which 
adaptive testing is implemented, it should be noted that the 
details of the Selection and Elimination Rules studied here might 
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be inappropriate if the distribution of examinee ability had a 
very different shape. Consider, for example, the situation in 
which the distribution of ability is U-shaped rather than bell- 
shaped. Then the screening on proportions-correct for the 
Selection Rules considered here eliminates exactly those items 
that are most likely to be useful. 

The criterion used to evaluate the operation of Selection and 
Elimination Rules was the information structure of a particular 
item pool. Although this item pool was built by the commonly 
accepted methods in adaptive testing, this pool would have been 
different if different items had been available for its 
construction. It is clear that the details of Selection and 
Elimination Rules should depend upon both the informatiori 
structure of the criterion pool, and the distribution of examinee 
ability. 

Only three Selection Rules and three Elimination Rules were 
analyzed, and this analysis took place over only a single cycle of 
item pool refreshment. The rigid adherence to a fixed combination 
of a Selection rule with an Elimination rule over many cycles 
cannot be recommended. For example, if Elimination Rule 3, the 
elimination of the 23 most used items and the 5 least used items , 
were consistently used with any of the Selection Rules, over many 
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cycles of item pool refreshment the effective range of ability 
over which the adaptive test measures well would shrink and no 
appropriate replacement items would be available. In practice, it 
seems better to maintain flexibility, and to choose Selection 
Rules and Elimination Rules for the next refreshment of the item 
pool on an ad-hoc basis by frequent examination of item pool 
statistics as adaptive testing proceeds. 

The Selection Rules studied all employ the screening of items 
on the basis of classical item statistics. This is more expensive 
than not screening items, as was done in the previous study, 
because of the necessary overproduction of items. The stricter 
the screening criteria, the greater the cost, as more of the items 
initially written will not meet the criteria. The benefit gained 
from the added expense of screening is the minimization of changes 
to the information structure of the item pool. 

It seems clear that providing more items for seeding provides 
more flexibility in the choice of replacement items. This 
enhanced flexibility makes it easier to maintain the information 
structure of the pool, but it, too, incurs real-world costs. To 
collect the data for on-line calibration of more items requires, 
for a fixed number of examinees , that each examinee respond to 
more seeded items. If this is not feasible in terms of 
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lengthening examinee testing time, then more examinees are 
required. This lengthens the time required to collect the data 
for on-line calibration. 

Eliminating over-exposed items from an adaptive test item 
pool seems a reasonable approach to maintaining test security. 

The definition of over-exposure used in this study was arbitrary 
-- the 25 most used items. No attempt was made to determine if 
this was reasonable. Other types of rules may function better in 
practice. For example, it may be more reasonable to set an 
absolute cut-off on the number of times an item can be 
administered before it is considered over-exposed. 

The elimination of under-exposed items from the adaptive test 
item pool should be approached with caution, since this may reduce 
the effective range of the adaptive test. Careful consideration 
is needed to decide whether this reduction in range is tolerable, 
given the original purpose for which the adaptive test was 
constructed, and the potential benefits in terms of the 
information structure of the new pool. 

For the particular context of this study, the results suggest 
two approaches to the choice of a selection rule combined with an 
elimination rule for a single cycle of item pool refreshment. One 
approach would eliminate only over-exposed items (Elimination 
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Rule 1) and choose replacements from 100 strictly screened seeded 
items (Selection Rule 3). A new item pool can be produced with 
almost the same information structure as the original pool. A 
second approach, one that should only be used with caution, would 
eliminate a small number of underexposed items also (Elimination 
Rule 3) . Replacements chosen from 100 moderately screened seeded 
items (Selection Rule 2) can result in a new pool that is also 
very similar in information structure tc the original pool. 

This small study was not designed to examine a wide variety 
of selection and elimination rules in a variety of different 
contexts. However, based on the results, two more general 
conclusions are suggested; 

1) Using conventional item statistics to screen items before 
deciding to seed them seems important and effective in terms of 
maintaining the information structure of the adaptive test item 
pool. The details of the screening criteria must depend upon the 
particular item pool and the examinees for whom the adaptive test 
is intended. 

2) The on-line calibration of larger sets of seedea items 
from which to select replacements can substantially improve the 
ease with which the information structure of the pool can be 
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maintained by providing added flexibility in the choice of 
replacement items . 
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Table 1 

Sununary Statistics for Proportions-Correct and r-Biserials 
on the Item Sets Produced by the Three Selection Rules 



Selection Rule 1 . n = 50 

Percentiles 





Mean 


S.D. 


Min 


Max 1 


10 


25 


50 


75 


90 


proportion- 

correct 


.58 


.24 


.21 


.90 1 


.27 


.32 


.63 


83 


.87 


r-biserial 


.53 


.16 


.23 


.73 1 


.26 


.40 


.57 


66 


.71 


Correlation 


between 


p^ and r-bis 


- .51 












Selection Rule 2. n 


- 100 




























Percentiles 






Mean 


S.D. 


Min 


Max 1 


10 


25 


50 


75 


90 


proportion- 

correct 


.58 


.23 


.21 


.90 1 


.27 


.34 


.63 


,80 


.87 


r-biserial 


.54 


.16 


.23 


. 84 1 


.30 


.40 


.58 


.67 


.72 


Correlation 


between 


p^ and r-bis * .55 












Selection Rule 3. n 


-- 100 




























Percentiles 






Mean 


S.D. 


Min 


Max 1 


10 


25 


50 


75 


90 


proportion- 

correct 


.61 


.12 


.42 


.80 I 


.44 


.48 


.63 


.70 


.78 


r-biserial 


.60 


.13 


.28 


.84 1 


.39 


.52 


.62 


.68 


.74 



Correlation between and r-bis - .33 
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Figure 1. Estimated test information functions for the 100- 
item pools at each Round of the previous simulations for the 
LOGIST-based method of on-line calibration. 
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Figure 2. Estimated test information functions for the set 
of 50 randomly selected candidate new items and the 25 replacement 
items selected from the 50 for the refreshment of the Round 0 pool 
in the previous simulations . 




True Test Information 
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Figure 3. True test information functions for the Round 0 
pool (solid line) and Round 1 pool (dotted line) from the previous 
simulations . 
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proport ion-correct 



Figure 4. Scatterplot of r-biserials (vertical axis) and 
proportions-correct (horizontal axis) for the set of 258 items. 
The two vertical lines mark the less restrictive limits on 
proportions-correct for Selection Rules 1 and 2. The horizontal 
line marks the limit on r-biserials used for all three Selection 
Rules . 
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Figure 5 . Estimated test information functions for the sets 
of seeded items resulting from the three Selection Rules of the 
current study and the random selection rule of the previous study . 
Also shown is the target test information function for the 25 most 
used items in the Round 0 item pool . 
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Figure 6. For Selection Rule 1: Estimated test information 

functions for the 100- item pools resulting from each Elimination 
Rule of the current study, the Round 1 pool of the previous study, 
and for the Round 0 pool (top); efficiency of each 100-item pool 
relative to the Round 0 pool (bottom) . 
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Figure 7. For Selection Rule 2: Estimated test information 

functions for the 100- item pools resulting from each Elimination 
Rule of the current study, the Round 1 pool of the previous study, 
and for the Round 0 pool (top); efficiency of each 100-item pool 
relative to the Round 0 pool (bottom) . 
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Figure 8. For Selection Rule 3: Estimated test information 

functions for the 100- item pools resulting from each Elimination 
Rule of the current study, the Round 1 pool of the previous study, 
and for the Round 0 pool (top); efficiency of each 100-item pool 
relative to the Round 0 pool (bottom) . 
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Figure 9. For Elimination Rule 1: Estimated test 

information functions for the 100- item pools resulting from each 
Selection Rule of the current study, the Round 1 pool rule of the 
previous study, and for the Round 0 pool (top); efficiency of each 
100- item pool relative to the Round 0 pool (bottom). 
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Figure 10. For Elimination Rule 2: Estimated test 

information functions for the 100-itera pools resulting from each 
Selection Rule of the current study, the Round 1 pool of the 
previous study, and for the Round 0 pool (top); efficiency of each 
100- item pool relative to the Round 0 pool (bottom). 
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Figure 11. For Elimination Rule 3: Estimated test 

information functions for the 100- item pools resulting from each 
Selection Rule of the current study, the Round 1 pool of the 
previous study, and for the Round 0 pool (top); efficiency of each 
100- item pool relative to the Round 0 pool (bottom). 
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